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Abstract 

In standard attractor neural network nnodels, specific patterns of activity are stored in tlie synaptic matrix, so that they 
become fixed point attractors of the networl< dynamics. The storage capacity of such networl<s has been quantified in two 
ways: the maximal number of patterns that can be stored, and the stored information measured in bits per synapse. In this 
paper, we compute both quantities in fully connected networks of N binary neurons with binary synapses, storing patterns 
with coding level /, in the large and sparse coding limits {N->co,f^O). We also derive finite-size corrections that 
accurately reproduce the results of simulations in networks of tens of thousands of neurons. These methods are applied to 
three different scenarios: (1) the classic Willshaw model, (2) networks with stochastic learning in which patterns are shown 
only once (one shot learning), (3) networks with stochastic learning in which patterns are shown multiple times. The storage 
capacities are optimized over network parameters, which allows us to compare the performance of the different models. We 
show that finite-size effects strongly reduce the capacity, even for networks of realistic sizes. We discuss the implications of 
these results for memory storage in the hippocampus and cerebral cortex. 
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Introduction 

Attractor neural networks have been proposed as long-term 
memory storage devices [1,2,3]. In such networks, a pattern of 
activity (the set of firing rates of all neurons in the network) is said 
to be memorized if it is one of the stable states of the network 
dynamics. Specific patterns of activity become stable states thanks 
to synaptic plasticity mechanisms, including both long term 
potentiation and depression of synapses, that create positive 
feed-back loops through the network connectivity. Attractor states 
are consistent with the phenomenon of selective persistent activity 
during delay periods of delayed response tasks, which has been 
documented in numerous cortical areas in behaving monkeys 
[4,5,6,7]. A long standing question in the field has been the 
question of the storage capacity of such networks. Much effort has 
been devoted to compute the number of attractor states that can 
be imprinted in the synaptic matrix, in networks of binary neurons 
[8,9,10,11]. Models storing patterns with a covariance rule 
[12,1,8,11] were shown to be able to store a number of patterns 
that scale linearly with the number of synapses per neuron. In the 
sparse coding limit (in which the average fraction of selective 
neurons per pattern / goes to zero in the large N limit), the 
capacity was shown to diverge as \/(f\\og(f)\). These scalings 
lead to a network storing on the order of 1 bit per synapse, in the 
large limit, for any value of the coding level. Elizabeth Gardner 
[10] computed the maximal capacity, in the space of all possible 
coupling matrices, and demonstrated a similar scaling for capacity 
and information stored per synapse. 



These initial studies, performed on the simplest possible 
networks (binary neurons, fuU connectivity, unrestricted synaptic 
weights) were followed by a second wave of studies that examined 
the effect of adding more neurobiological realism: random diluted 
connectivity [9], neurons characterized by analog firing rates [13], 
learning rules in which new patterns progressively erase the old 
ones [14,15]. The above mentioned modifications were shown not 
to affect the scahng laws described above. One particular 
modification however was shown to have a drastic effect on 
capacity. A network with binary synapses and stochastic on-line 
learning was shown to have a drastically impaired performance, 
compared to networks with continuous synapses [16,17]. For finite 
coding levels, the storage capacity was shown to be on the order of 
not stored patterns, while the information stored per 
synapse goes to zero in the large limit. In the sparse coding limit 
however (f ~ log (N)/N), the capacity was shown to scale as 1/ f^, 
and therefore a similar scaling as the Gardner bound, while the 
information stored per synapse remains finite in this limit. These 
scaling laws are similar to the Willshaw model [18], which can be 
seen as a particular case of the Amit-Fusi [1 7] rule. The model was 
then subsequently studied in greater detail by Huang and Amit 
[19,20] who computed the storage capacity for finite values of A'^, 
using numerical simulations and several approximations for the 
distributions of the 'local fields' of the neurons. However, 
computing the precise storage capacity of this model in the large 
A^ limit remains an open problem. 

In this article we focus on a model of binary neurons where 
binary synapses are potentiated or depressed stochastically 
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Author Summary 

Two central hypotheses in neuroscience are that long-term 
memory is sustained by modifications of the connectivity 
of neural circuits, while short-term memory is sustained by 
persistent neuronal activity following the presentation of a 
stimulus. These two hypotheses have been substantiated 
by several decades of electrophysiological experiments, 
reporting activity-dependent changes in synaptic connec- 
tivity In vitro, and stimulus-selective persistent neuronal 
activity in delayed response tasks in behaving monkeys. 
They have been implemented in attractor network models, 
that store specific patterns of activity using Hebbian 
plasticity rules, which then allow retrieval of these patterns 
as attractors of the network dynamics. A long-standing 
question in the field is how many patterns (or equivalently, 
how much information) can be stored in such networks? 
Here, we compute the storage capacity of networks of 
binary neurons and binary synapses. Synapses store 
information according to a simple stochastic learning 
process that consists of transitions between synaptic states 
conditioned on the states of pre- and post-synaptic 
neurons. We consider this learning process in two limits: 
a one shot learning scenario, where each pattern is 
presented only once, and a slow learning scenario, where 
noisy versions of a set of patterns are presented multiple 
times, but transition probabilities are small. The two limits 
are assumed to represent, in a simplified way, learning in 
the hippocampus and neocortex, respectively. We show 
that in both cases, the information stored per synapse 
remains finite in the large N limit, when the coding is 
sparse. Furthermore, we characterize the strong finite size 
effects that exist in such networks. 



are presented during tlie learning pliase. Tlie state of neuron 
/ = 1 , . . . in pattern fi=l, . . . ,P is 

( 1 with probability/ 
' ~ \ 0 with probabihty 1 -/ ' 

where / is the coding level of the memories. We study this model 
in the limit of low coding level, /— >0 when N-^co. In all the 
models considered here, P scales as 1 //^ in the sparse coding limit. 
Thus, we introduce a parameter a = Pf^ which stays of order 1 in 
the sparse coding limit. 

After the learning phase, we choose one of the P presented 
patterns <J''° , and check whether it is a frxed point of the dynamics: 

ff,(r+l) = 0[/j,(O-y5V0], (2) 

where 

TV 

hi(t)=Y, WyGjit) (3) 

.;=i 

is the total synaptic input ("field") of neuron /, 9 is a scaled 
activation threshold (constant independent of N), and 0 is the 
Heaviside function. 

Field averages. When testing the stability of pattern ^''^ after 
learning P patterns, we need to compute the distribution of the 
fields on selective neurons (sites / such that <^f° = 1), and of the 
fields on non-selective neurons (sites / such that <^f°=0). The 
averages of those fields are fNg+ and fNg respectively, where 



depending on the states of pre and post synaptic neurons [17]. We 
first introduce analytical methods that allow us to compute the 
storage capacity in the large N limit, based on a binomial 
approximation for the synaptic inputs to the neurons. We first 
illustrate it on the WiUshaw model and to recover the well-known 
result on the capacity of this model [18,21,22]. We then move to a 
stochastic learning rule, in which we study two dilferent scenarios: 
(i) in which patterns are presented only once - we wiU refer to this 
model as the SP (Single Presentation) model [17]; (ii) in which 
noisy versions of the patterns are presented multiple-times - the 
MP (Multiple presentations) model [23]. For both models we 
compute the storage capacity and the information stored per 
synapse in the large N limit, and investigate how they depend on 
the various parameters of the model. We then study fmite size 
effects, and show that they have a huge effect even in networks of 
tens of thousands of neurons. Finally we show how capacity in 
fmite size networks can be enhanced by introducing inhibition, as 
proposed in [19,20]. In the discussion we summarize our results 
and discuss the relevance of the SP and MP networks to memory 
maintenance in the hippocampus and cortex. 

Results 

Storage capacity in the A^-»oo limit 

The network. We consider a network of binary (0,1) 
neurons, fully connected through a binary (0,1) synaptic 
connectivity matrix. The activity of neuron / {i=l...N) is 
described by a binary variable, (7,=0,1. Each neuron can 
potentially be connected to every other neurons, through a binary 
connectivity matrix W. This connectivity matrix depends on P 
random uncorrelated patterns ('memories') i", ii=\, . . . ,P that 



and 



r+=p(n/^,=iicr"=c7=i) 



[f^^y=l|(ef,^;«)^(l,l)). 



(4) 



(5) 



Pattern is perfectly imprinted in the synaptic matrix if 
^+ = 1 and g" = 0. However, because of the storage of other 
patterns, g+ and g take intermediate values between 0 and 1. Note 
that here we implicidy assume that the probability of finding a 
potentiated synapse between two neurons ij such that 
= ^j"" = 0 or <^f° 7^ ^J" is the same. This is true for the models 
we consider below. g+ and g are function of a, /, and other 
parameters characterizing learning. 

Information stored per synapse. One measure of the 
storage capability of the network is the information stored per 
synapse: 



,iV(-/l0g^/-(l-/)l0g2(l-/)) 

A2 



Il0g2/I 

' fN 



(6) 



(7) 



where Pmax is the size of a set of patterns in which each pattern is a 
fixed point of the dynamics with probability one. When a. is of 
order one, for the information per synapse to be of order one in 
the large N limit, we need to take / as 
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MN 

iv~- 



(8) 



In this case the information stored per synapse has the simple 
expression: 



/?ln2 



(9) 



X„ =- P<i>ig,e) In TV + In TV + o( In Af). 



(15) 



For P„f, to go to 1 in the large limit, we need both and X„ 
to go to — GO in that limit. This will be satisfied provided 



InlnAf 



(16) 



Computing the storage capacity. Our goal here is to 
compute the size Pftiax — ^/f^ of the largest set of patterns that can 
be stored in the connectivity matrix. The criterion for storage that 
we adopt is that if one picks a pattern in this set, then this pattern is 
a fixed point of the dynamics with probability I . We thus need to 
compute the probability Pne of no error in retrieving a particular 
pattern /ig. To compute this probability, we first need to estimate 
the probabilities that a single selective/non-selective neuron is in 
its right state when the network is initialized in a state 
corresponding to pattern Hq. For a pattern with M selective 
neurons, and neglecting correlations between neurons (which is 
legitimate if/ «1/\/N [17]), we have 



[i-P{h,<fNe\e: 



''0 . 



'P(/!,>/iV0|cf°=O)) 



(10) 



Clearly, for P„e to go to 1 in the large limit, the probabilities 
for the fields of single neurons to be on the wrong side of the 
threshold have to vanish in that limit. A first condition for this to 
happen is > 0 > g - if these inequalities are satisfied, then the 
average fields of both selective and non-selective neurons are on 
the right side of the threshold. When g+ and g are sufficiently far 
from 6, the tail probabilities of the distribution of the fields are 



{hi<JNe\e;° = 1) = exp(-M(l)(g+,0) + o(M)) (11) 



P{hi>fNe\^'^ =0) = exp{-M^(g,e) + o(M)) (12) 

where ^{g+,6), ^(g,9) are the rate functions associated with the 
distributions of the fields (see Methods). Neglecting again 
correlations between inputs, the distributions of the fields are 
binomial distributions, and the rate functions are 



<D(.x:,0) = e In ^ + ( 1 - 0) In ^ 
X l—x 



(13) 



Inserting Eqs. (1 1,12,13,8) in Eq. (10), we fmd that 

P„, = exp[- exp(i:,) - exp(X„)] (14) 

where 



1 



(17) 



These inequalities are equivalent in the large N limit to the 
inequalities 



g+>e>g+i: 



(18) 



where C is given by the equation ^{g + [,,6)= \j fi. 

The maximal information per synapse is obtained by saturating 
inequalities (16) and (17), and optimizing over the various 
parameters of the model. In practice, for given values of a, and 
parameters of the learning process, we compute g and g+; we can 
then obtain the optimal values of the threshold 6 and the rescaled 
coding level ji as 



A'-> + oo 



(19) 



(20) 



and compute the information per synapse using Eq. (9). We can 
then find the optimum of / in the space of all parameters. 

Before applying these methods to various models, we would like 
to emphasize two important features of these calculations: 

• In Eq. (16), note that the r.h.s. goes to zero extremely slowly as 

goes to 00 (as \nh\ N /\t\N) - thus, we expect huge finite size 
effects. This will be confirmed in section Tinite-size networks' 
where these finite size effects are studied in detail. 

• In the sparse coding limit, a Gaussian approximation of the 
fields gives a poor approximation of the storage capacity, since 
the calculation probes the tail of the distribution. 



Willshaw model 

The capacity of the WiUshaw model has already been studied by 
a number of authors [18,21,22]. Here, we present the application 
of the analysis described in the previous section to the Willshaw 
model, for completeness and comparison with the models 
described in the next sections. In this model, after presenting P 
patterns to the network, the synaptic matrix is described as follows: 
Wij = 1 if at least one of the P presented patterns had neuron i and 
j co-activated, Wij = 0 otherwise. Thus, after the learning phase, 
we have. 



X,= -P'i>(g+,6)\nN+ lnlnAf + o(lnlnAf) 



PLOS Computational Biology | www.ploscompbiol.org 



3 



August 2014 | Volume 10 | Issue 8 | el 003727 



Memory Capacity of Networks with Stochastic Binary Synapses 



^=l-(l-/2/~i_ exp(-a) for small/ 



(21) 



Saturating the inequalities (19,20) with g fixed, one obtains the 
information stored per synapse, 



ln(l-^) Ing 



1 

in2 



(22) 



The information stored per synapse is shown as a function g 
in Figure la. A maximum is reached for g = 0.5 at 
iV= ln2 = 0.69bits/synapse, but goes to zero in both the g^O 
and g"— >1 limits. The model has a storage capacity comparable to 
its maximal value, iopt>0-5iffr in a large range of values of g 
(between 0.1 and 0.9). We can also optimize capacity for a given 
value of jS, as shown in Figure lb. It reaches its maximum at 
P=\A, and goes to zero in the small and large p limits. Again, the 
model has a large storage capacity for a broad range of P, 
iopt>0.5iw for p between 0.4 and 10. 

Previous studies [18,21] have found an optimal capacity of 
0.69bits/synapse. Those studies focused on a feed-forward 
network with a single output neuron, with no fluctuations in the 
number of selective neurons per pattern, and required that the 
number of errors on silent outputs is of the same order as the 
number of selective outputs in the whole set of patterns. In the 
calculations presented here, we have used a different criteria, 
namely that a given pattern (not all patterns) is exactly a fixed 
point of the dynamics of the network with a probability that goes 
to one in the large N limit. Another possible definition would be to 
require that all the P patterns are exact fixed points with 
probability one. In this case, for patterns with fixed numbers of 
selective neurons, the capacity drops by a factor of 3, 
ln(2)/3 = 0.23 bits/synapse, as already computed by Knoblauch 
et al [22]. 

Amit-Fusi model 

A drawback of the WiUshaw learning rule is that it only allows 
for synaptic potentiation. Thus, if patterns are continuously 
presented to the network, all synapses will eventually be 
potentiated and no memories can be retrieved. In [17] Amit and 
Fusi introduced a new learning rule that maintains the simplicity 
of the WHlshaw model, but allows for continuous on-line learning. 
The proposed learning rule includes synaptic depression. At each 
learning time step f-i, a new pattern i^*" with coding level / is 
presented to the network, and synapses are updated stochastically: 




Figure 1. Optimized information capacity of the Willshaw 
model in tlie limit A' -> -I- cc>. Information is optimized by saturating 
(19) (8 = 1) and (20): a. i^p, as a function of g, b. as a function of 

P = fN/\n N. 

doi:10.1371/journal.pcbi.1003727.g001 



• for synapses such that f f = C/' = 1 : 

if Wij(i.i—\) = 0, then Wyip) is potentiated to 1 with probability 
q+; and if Wiji^i — 1)= 1 it stays at I. 

• for synapses such that t^C/": 

if Wij{n—\) = Q, then Wyip) stays at 0; and if Wii{p.— \) = \ it is 
depressed to 0 with probability </_ . 

• for synapses such that f f = cj' = 0, Wydi) = Wijdi - 1 ). 

The evolution of a synapse Wij during learning can be described 
by the following Markov process: 



jii+l _ 



i-fl 



h 

i-h 



(23) 



where a=f^q^ is the probability that a silent synapse is 
potentiated upon the presentation of pattern /i and 
6 = 2/(1— is the probability that a potentiated synapse is 
depressed. After a sufficient number of patterns has been 
presented the distribution of synaptic weights in the network 
reaches a stationary state. We study the network in this stationary 
regime. 

For the information capacity to be of order 1 , the coding level 
IniV 

has to scale as , as in the WiUshaw model, and the effects of 

N 

potentiation and depression have to be of the same order [17]. 
Thus we define the depression-potentiation ratio 5 as. 



2/(l-/)g- 



(24) 



We can again use Eq. (9) and the saturated inequalities (19,20) 
to compute the maximal information capacity in the limit N—>-oD. 
This requires computing g and g+ , defined in the previous section, 
as a function of the difierent parameters characterizing the 
network. We track a pattern that has been presented P time 
steps in the past. In the following we refer to P as the age of the 
pattern. In the sparse coding limit, g corresponds to the probability 
that a synapse is potentiated. It is determined by the depression- 
potentiation ratio S, 



(25) 



and 



g+=g + q+a-g)(l-a-bf 

Q^a (26) 



where a. = Pf^. Our goal is to determine the age P of the oldest 
pattern that is stiU a fixed point of the network dynamics, with 
probability one. Note that in this network, contrary to the 
Willshaw model in which all patterns are equivalent, here 
younger patterns, of age P' < P, are more strongly imprinted in 
the synaptic matrix, g-+(P') >g+(P), and thus also stored with 
probability one. 
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10" p 10^ 



0 0.2 0.4 0.6 0.8 1 

q+ 



Figure 2. Optimized information capacity for the SP model in 
tiie limit A' -> + x. a. i^p, as a function of g, b. („,,, as a function of <5 , 
the ratio between the number of depressing events and potentiating 

TV 

events at pattern presentation, c. as a function of /? = /-j — d. 

iopt as a function of the LTP transition probability q+. 
doi:1 0.1 371/journal.pcbi.1 003727.g002 



:1)=1-(1-/)A- and 



P(<^f"'' = 0) = (l-/)x for 
P(c;f"-' = l)=/x and 
P(cf' •' = 0)=l-/x: for cj' 



M _ 



(28) 



Here x is a noise level: if x = 0, presented patterns are identical 
to the prototypes, while if x=\, the presented patterns are 
uncorrelated with the prototypes. As for the SP model this model 
achieves a finite non-zero information capacity iopt in the large N 
limit if the depression-potentiation ratio d is of order one, and if 

In AT 



the coding level scales with network size as fee - 



N 



If learning is 



slow, and the number of presentations of patterns of 

each class becomes large the probabilities g and g+ are [23]: 



E 



(1 



- x)^ n + ax(2 — x) exp ( — a) 



-x)^n + o£(,5 + x(2-x)) 



n! 



(29) 



and 



Choosing an activation threshold and a coding level that 
saturate inequalities (19) and (20), information capacity can be 
expressed as: 



(1- 



-xf(n- 



l) + ax(2 — x) 01^ exp ( — 01) 



^,(l-xnn+Y) + oiiS + x(2 - X)) 



n! 



(30) 



1+ 



-In 



1 



1+ ■ 



g+ -g. 



g+ I0g2^+(l-g+)l0g2i-^ 



= [ ( 1 + .59+ e-"" +*'+ ) log2 ( 1 + <5?+ e-"-" ) 
+ 5(l-9+e-"<i+^'''+)log2(l-<?+e-^<'+^'"+)] 



The optimal information /jp = 0.083 bits/synapse is reached 
for =1, 6 = 0.72, /? = 2.44, a = 0. 14, (5 = 2.57 which gives 
g = 0.28,^+=0.72. 

The dependence of iopt on the different parameters is shown in 
Figure 2. Panel a shows the dependence on g the fraction of 
activated synapses in the asymptotic learning regime. Panels h, c 
and d show the dependence on 8, p and Note from panel c 
that there is a broad range of values of /J that give information 
capacities similar to the optimal one. One can also observe that the 
optimal information capacity is about 9 times lower in the SP 
model than in the Willshaw model. This is the price one pays to 
have a network that is able to continuously learn new patterns. 
However, it should be noted that at maximal capacity, in the 
Willshaw model, every pattern has a vanishing basin of attraction 
while in the SP model, only the oldest stable patterns have 
vanishing basins of attraction. This feature is not captured by our 
measure of storage capacity. 

Multiple presentations of patterns, slow learning regime 

In the SP model, patterns are presented only once. Brunei et al 
[23] studied the same network of binary neurons with stochastic 
binary synapses but in a difiFerent learning context, where patterns 
are presented multiple times. More precisely, at each learning time 



step t, a noisy version f 
presented to the network, 



of one of the P prototypes <^'' 



We inserted those expressions in Eqs. (19,20) to study the 
maximal information capacity of the network under this learning 
protocol. The optimal information iMP = 0-69 bits/synapse is 
reached at x = 0 for 9^1, y8-»1.44, (5->0, a->0.69 which gives 

^+->1. In this limit, the network becomes equivalent to 

the WUlshaw model. 

The maximal capacity is about 9 times larger than for a network 
that has to learn in one shot. On Figure 3a we plot the optimal 
capacity as a function of g. The capacity of the slow learning 
network with multiple presentations is bounded by the capacity of 
the Willshaw model for all values of g, and it is reached when the 
depression-potentiation ratio (5-»0. For this value, no depression 
occurs during learning: the network loses palimpsest properties, i.e. 
the ability to erase older patterns to store new ones, and it is not 
able to learn if the presented patterns are noisy. The optimal 
capacity decreases with S, for instance at 6=1 (as many 
potentiation events as depression events at each pattern presen- 
tation), /op, =0.35 bits/synapse. Figure 3c shows the dependence 
N 

as a function of B = f -, — — . In Figwe 3d, we show the optimized 

capacity for different values of the noise x in the presented 
patterns. This quantifies the trade-off between the storage capacity 
and the generalization ability of the network [23]. 

Finite-size networks 

The results we have presented so far are valid for infinite size 
networks. Finite-size effects can be computed for the three models 
we have discussed so far (see Methods). The main result of this 
section is that the capacity of networks of realistic sizes is very far 
from the large N limit. We compute capacities for finite networks 
in the SP and MP settings, and we validate our finite size 
calculations by presenting the results of simulations of large 
networks of sizes 7V= 10,000, Af = 50,000. 
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Figure 3. Optimized information capacity for the MP model in 
the limit -> + cc>. a. Optimal information capacity as a function of 
g, the average number of activated synapses after learning. Optimal 
capacity is reached in the limit (5 -> 0 and at .t= 0 where the capacity is 
the same as for the Willshaw model, b. Dependence of information 
capacity on S, the ratio between the number of depressing events and 
potentiating events at pattern presentation, c. Dependence on 



P =f 



N 



In N 



d. Dependence on the noise in the presented patterns, 



X. This illustrates the trade-off between the storage capacity and the 
generalization ability of the network. 
doi:1 0.1 371 /journal.pcbi.1 003727.g003 



We summarize the finite size calculations for the SP model (a 
more general and detailed analysis is given in Methods). In the 
finite network setting, conditional on the tested pattern ^Uq having 
M+1 selective neurons, the probability of no error P„^, is given by 



exp[— exp{Xs) — exp{X„) 



tor N= 10,000 and Af = 50,000for learning andnetworkparameters 
chosen to optimize the storage capacity of the infinite-size network 
(see Section 'Amit-Fusi model'). We show the result for two different 
approximations of the field distribution: a binomial distribution 
(magenta), as used in the previous calculations for infinite size 
networks; and a gaussian (red) approximation (see Methods for 
calculations) as used by previous authors [19,20,24]. For these 
parameters the binomial approximation gives an accurate estimation 
ofPne, while the gaussian calculation overestimates it. 

The curves we get are far from the step functions predicted for 
N^ + ca by Eq. (45). To understand why, compare Eqs. (15), 
and (31): finite size effects can be neglected when 

i( - mg+ m » and K - mg,o) + i)i » ^" ^ 



IniV 



IniV 



. Because 



the finite size effects are of order — , it is only for huge values 

In Af 

of N that the asymptotic capacity can be recovered. For instance 
if we choose an activation threshold 6 slightly above the 
optimal threshold given in Section 'Amit-Fusi model' 
(0 = 0„^, + O.Ol=O.73), then -y8<t(g,0) + 1 = -0.06, and for 

A=10l™ we only have | -/?q)(g,e)+ 1| ~ 3 In Figure 4c 

we plot Pile as a function of where a„pt = 0.14 is the value of a 



that optimizes capacity in the large N limit, 6 = 0.73 and the other 
parameters are the one that optimizes capacity. We see that we are 
still far from the large limit for A= 10""*. Networks of sizes 
lO'' — 10^ have capacities which are only between 20% and 40% 
of the predicted capacity in the large N limit. Neglecting 
fluctuations in the number of selective neurons, we can derive 
an expression for the number of stored patterns P that includes the 
leading finite size correction for the SP model. 



P(N) = ci 



A2 



(InA)^ 



/InlnA / /In In A ^ 

l-C2\/-r-^+o( 



In A 



In A 



(32) 



with 



= - 13 j^fQ>{g+, 9m) In N+-\n\nN 



l-exp(^(g+,0M))) 2710^(1 -0m) 



-In 



Xn = {- PM^(g^SM) + 1 ) In A - - In In A 



-0(1) 



-In 



l-exp(-^fe,0M))) 2neMil-eM)pM 



(31) 



-0(1) 



M fN 

where Pm— ^ ]\f '^^~^J\4 "^'^"^ ^ given by Eq. (13). In the 
calculations for A^ -f oo discussed in the previous sections we kept 
only the dominant term in In A, which yields Eqs. (19) and (20). 
In the above equations, the first order corrections scale as 

which has a dramatic effect on the storage capacity of 

fmite networks. In Figure 4a,b, we plot Pne (where the bar denotes 
an average over the distribution of M) as a function of the age of 
the pattern, and compare this with numerical simulations. It is plotted 



where ci and C2 are two constants (see Methods). 

If we take fluctuations in the number of selective neurons 
into account, it introduces other finite-size effects as can be 
seen from Eqs. (43) and (44) in the Methods section. These 

VP 1-0 



fluctuations can be discarded if \( — P^{g+,6))\» 



and \(l-mgM» 



InAl-g-F 
In Figure 4d we plot P^, for 



InAl-.g 

We see that finite size effects are even 



diflerent values of N. 
stronger in this case. 

To plot the curves of Figure 4, we chose parameters to be those 
that optimize storage capacity for infinite network sizes. When A^ 
is finite, those parameters are no longer optimal. To optimize 
parameters at finite A, since the probability of error as a function 
of age is no longer a step function, it is not possible to find the last 
pattern stored with probabihty one. Instead we define the capacity 

Pc as the pattern age for which P„e=2- Using Eqs. (31) and 
performing an average over the distribution of M, we find 
parameters optimizing pattern capacity for frxed values of (S. 
Results are shown on Figure 5a,b for A = 10,000 and A = 50,000. 
We show the results for the different approximations used to 
model the neural fields: the blue line is the binomial approxima- 
tion, the cyan line the gaussian approximation and the magenta 
one is a gaussian approximation with a covariance term that takes 
into account correlations between synapses (see Methods and 
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Figure 4. Finite size effects. Shown is P„,,, tKie probability that a tested pattern of a given age is stored without errors, for the SP 
model, a. P,,,, as a function of the age of the tested pattern. Parameters are those optimizing capacity at A' -> + tj, results are for simulations 
(blue line) and calculations with a binomial approximation of the fields distributions (magenta) and a gaussian approximation (red); P,„, is averaged 
over different value of M, the number of selective neurons in the tested pattern (magenta line), b Same for N = 50,000. c. P„f as a function of a 
scaled version of pattern age (see text for details), fluctuations in M are discarded on this plot. d. Same as c with an average of P„,, over different M. 
doi:1 0.1 371 /journal.pcbi.1 003727.g004 



[19,20]). For /< the storage capacity of simulated networks 

(black crosses) is well predicted by the binomial approximation 
while the gaussian approximations over-estimates capacity. For 

/ > , the correlations between synapses can no longer be 

neglected [17]. The gaussian approximation with covariance 
captures the drop in capacity at large /. 

For A^= 10,000, the SP model can store a maximum of 
Pc = 7,800 patterns at a coding level / = 0.001 5 (see blue curve in 
figure 5c). As suggested in Figures 4c,d, the capacity of finite 
networks is strongly reduced compare to the capacity predicted for 
infinite size networks. More precisely, if the network of size 
iV = 1 0,000 had the same information capacity as the infinite size 
network (27), it would store up to P = 70,000 patterns at coding 
level / = 0.0007. Part of this decrease in capacity is avoided if we 
consider patterns that have a fixed number fN of selective 
neurons. This corresponds to the red curve in figure 4c. For fixed 
sizes the capacity is approximately twice as large. Note that finite- 
size effects tend to decrease as the coding level increases. In 
Figure 5c, / = 5.10^'', and the capacity is 3% of the value 
predicted by the large limit calculation. The ratio of actual to 
asymptotic capacities increases to 10% at/=1.10^' and 21% at 
/=1.10^^. In Figure 5d, we do the same analysis for the MP 



model with A^= 10,000. Here we have also optimized all the 
parameters, except for the depression-potentiation ratio which is 
set to (5 = 1 , ensuring that the network has the palimpsest property 
and the ability to deal with noisy patterns. For = 10,000, the MP 
model with &= \ can store up to Pc = 70,000 patterns, at/ = 0.00 1 
(versus Pr = 7,800 at/ = 0.001 5 for the SP model). One can also 
compute the optimized capacity for a given noise level. At .\: = 0.1, 
P, = 20,900 for/ = 0.0012 and <5 = 4.3 or at x = 0.2, ^^ = 8,900 
for/ = 0.0018 and ^ = 6.9. 

Storage capacity with errors 

So far, we have defined the storage capacity as the number of 
patterns that can be perfectly retrieved. However, it is quite 
common for attractor neural networks to have stable fixed point 
attractors that are close to, but not exactly equal to, patterns that 
are stored in the connectivity matrix. It is difficult to estimate 
analytically the stability of patterns that are retrieved with errors as 
it requires analysis of the dynamics at multiple time steps. We 
therefore used numerical simulations to check whether a tested 
pattern is retrieved as a fixed point of the dynamics at a sufficientiy 
low error level. To quantify the degree of error, we introduce the 

overlap between the network fixed point a* and the 

tested pattern i^''", with M selective neurons 
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3x10^ SPmodel;N = 10,000 b xlO^ SP model ; N = 50,000 




Figure 5. Capacity at finite N. a,b. as a function of / for the SP model and A'= lO'', S.W* Parameters are chosen to optimize capacity under 
the binomial approximation. Shown are the result of the gaussian approximation without covariance (cyan) and with covariance (magenta) for these 
parameters, c. Optimized P^. as a function of / for the SP model at A'= 10,000. The blue curve is for patterns with fluctuations in the number of 
selective neurons. The red curve is for the same number of selective neurons in all patterns. The black curve is the number of patterns that would be 
stored if the network were storing the same amount of information as in the case A' ^ + oo. d. Same for the MP model, where parameters have 
been optimized, but the depression-potentiation ratio is fixed at d = 1. 
doi:1 0.1 371 /journal.pcbi.1 003727.g005 



In Figure 6a we show Pc(m), the number of fixed-point 
attractors tliat have an overlap larger than m with the 
corresponding stored pattern, for m=\, /Ji = 0.99 and m = Q.7. 
Note that only a negligible number of tested patterns lead to 
fixed points with m smaller than 0.7, for A'^= 10,000 neurons. 
Considering fixed points with errors leads to a substantial 
increase in capacity, e.g. for / = 0.00 18 the capacity increases 
from P,(m = 1) = 7,800 to P,(to = 0.7) = 10,400. In Figure 6b, we 
quantify the information capacity in bits stored per synapse, 
defined as in Eq. (6), / = P,(-/log2/-(l -/)Iog2 (1 -/))/7V. 
Note that in the situation when retrieval is not always perfect this 
expression is only an approximation of the true information 
content. The coding level that optimizes the information 
capacity in bits per synapse / is larger (fopt — 0.003) than the 
one that optimizes the number of stored patterns Pc 
{/^^;~ 0.002), since the information content of individual patterns 
decreases with /. Finally, note that the information capacity is 
close to its optimum in a broad range of coding levels, up to 
/^O.Ol. 



Increase in capacity with inhibition 

As we have seen above, the fluctuations in the number of 
selective neurons in each pattern lead to a reduction in storage 
capacity in networks of finite size (e.g. Figure 5c,d). The 
detrimental effects of these fluctuations can be mitigated by 
adding a uniform inhibition rj to the network [19]. Using a simple 
instantaneous and linear inhibitory feed-back, the local fields 
become 

h:=f: W^*C-'7Ec? (34) 

k=l k=l 

For infinite size networks, adding inhibition does not improve 
storage capacity since fluctuations in the number of selective 
neurons vanish in the large N limit. However, for finite size 
networks, minimizing those fluctuations leads to substantial 
increase in storage capacity. When testing the stability of pattern 
, if the number of selective neurons is unknown, the variance of 
the field on non-selective neurons is Nf{g — 2v]g-\-rf-), and 
Nf{gj^—lrig^-\-tf) for selective neurons (for small /). The 
variance for non-selective neurons is minimized if 17 = ^, yielding 
the variance obtained with fixed size patterns. The same holds for 
selective neurons at rj = g+. Choosing a value oirj between g and 
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^"0.6 



SP model ; N = 10,000 
* perfect retrieval, m = 1 |^ 
Astable patterns with errors, nn>0.99 
©stable patterns with errors, m>0.7 
- binomial, m=1 





Figure 6. Storage capacity with errors in the SP model. Instead of counting only patterns that are perfectly retrieved, patterns that lead to 
fixed points of the dynamic overlapping significantly (see text for the definition of the overlap) with the tested memory pattern are also counted. 
Simulations are done with the same parameters as in Figure 5a. a. as a function of /. Blue crosses correspond to fixed points that are exactly the 
stored patterns. Red triangles correspond to fixed points that have an overlap larger than 0.99, and brown circles an overlap larger than 0.7. b. Same 

as a. but instead of quantifying storage capacity with it is done with /= — ^ ^"fe / (1 /)log2(l /)) 

doi:1 0.1 371/journal.pcbi.1 003727.g006 



N 



g+ brings the network capacity towards that of frxed size patterns. 
In Figure 7 a, we show the storage capacity as a function of / for 
these three scenarios. Optimizing the inhibition rj increases the 
maximal capacity by 28% (green curve) compared to a network 
with no inhibition (blue curve). Red curve is the capacity without 
pattern size fluctuations. Inhibition increases the capacity from 
P, = 7,800 at/ = 0.0018 to P,. = 12,000. In Figure 7b, information 
capacity measured in bits per synapse is shown as a function of f in 

the same three scenarios. Note again that for /= —p= =0.01, the 



/N 



capacity is quite close to the optimal capacity. 



Discussion 

We have presented an analytical method to compute the storage 
capacity of networks of binary neurons with binary synapses in the 
sparse coding limit. When applied to the classic WiUshaw model, 
in the infinite limit, we find a maximal storage capacity of 
In 2 = 0.69 bits/synapse, the same than found in previous studies, 
although with a different definition adapted to recurrent networks, 
as discussed in the section 'WiUshaw model'. We then used this 
method to study the storage capacity of a network with binary 
synapses and stochastic learning, in the single presentation (SP) 



SP model ; N = 10,000 




Figure 7. Storage capacity optimized with inhibition in the SP model. Blue is for a fixed threshold and fluctuations in the number of selective 
neurons per pattern. Green, the fluctuations are minimized using inhibition. Red, without fluctuations in the number of selective neurons per pattern, 
a. Number of stored patterns as a function of the coding level /. b. Stored information in bits per synapse, as a function of /. 
doi:10.1371/journal.pcbi.1003727.g007 
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si:enari<) [17]. The main advantage of this model, compared to the 
WiUshaw model, is its palimpsest property, that aUows it to do on- 
line learning in an ever changing environment. Amit and Fusi 
showed that the optimal storage capacity was obtained in the 

sparse coding limit, /oc ^ and with a balance between the effect 

of depression and potentiation. The storage capacity of this 
network has been further studied for finite size networks in 
[19,20]. We have complemented this work by computing 
analytically the storage capacity in the large limit. The optimal 
capacity of the SP model is 0.083 bits/synapse, which is about 
9 times lower than the one of the WiUshaw model. This decrease 
in storage capacity is similar to the decrease seen in palimpsest 
networks with continuous synapses - for example, in the 
Hopfield model the capacity is about 0.14 bits/synapse, 
while in a palimpsest version the capacity drops to about 
0.05 bits/synapse. The reason for this decrease is that the most 
recently seen patterns have large basins of attraction, while older 
patterns have smaller ones. In the WiUshaw model, aU patterns are 
equivalent, and therefore they aU have vanishing basins of 
attraction at the maximal capacity. 

We have also studied the network in a multiple presentation 
(MP) scenario, with in which patterns presented to the network are 
noisy versions of a fixed set of prototypes, in the slow learning limit 
in which transition probabilities go to zero [23]. In the extreme 
case in which presented patterns are the prototypes, aU synaptic 
weights are initiaUy at zero, and if the synapses do not experience 
depression, this model is equivalent to the WiUshaw model with a 
storage capacity of 0.69 bits/synapse, which is about 9 times 
larger than the capacity of the SP model. A more interesting 
scenario is when depression is present. In this case then the 
network has generalization properties (it can learn prototypes from 
noisy versions of them), as weU as palimpsest properties (if patterns 
drawn from a new set of prototypes are presented it wiU eventually 
replace a previous set with the new one). We have rjuantified the 
trade-off between generalization and storage capacity (see 
Figure 3d). For instance, if the noisy patterns have 80% of their 
selective neurons in common with the prototypes to be learned, 
the storage capacity is decreased from 0.69 to 0.12 bits/synapse. 

A key step in estimating storage capacity is deriving an accurate 
approximation for the distribution of the inputs neurons receive. 
These inputs are the sum of a large number of binary variables, so the 
distribution is a binomial if one can neglect the correlations between 
these variables, induced by the learning process. Amit and Fusi [1 7] 
showed that these correlations can be neglected when f«l/\/N. 
Thus, we expect the results with the binomial approximation to be 
exact in the large N limit. We have shown that a Gaussian 
approximation of the binomial distribution gives inaccurate results in 
the sparse coding limit, because the capacity depends on the tail of the 
distribution, which is not well described by a Gaussian. For larger 
coding levels (/^ ^ 1/ VN), the binomial approximation breaks down 
because it does not take into account conx-lations between inputs. 
FoUowing [19] and [20], we use a Gaussian approximation that 
includes the covariance of the inputs, and show that this approxi- 
mation captures weU the simulation results in this coding level range. 

We computed storage capacities for two different learning 
scenarios. Both are unsupervised, involve a Hebbian-t)'pe plasticity 
rule, and allow for online learning (providing patterns are 
presented multiple times for the MP model). It is of interest to 
compare the performance of these two particular scenarios with 
known upper bounds on storage capacity. For networks of infinite 
size with binary synapses such a bound has been derived using the 
Gardner approach [25]. In the sparse coding limit, this bound is 
^0.29 bits/synapse with random patterns (in which fluctuations 



in the number of selective neurons per pattern fluctuates), and 
c^OAS bits/synapse if patterns have a fixed number of selective 
neurons [26] . We found a capacity of isp = 0.083 bits/synapse for 
the SP model and /Mi> = 0.69 bits/synapse for the MP model, 
obtained both for patterns with fixed and variable number of 
selective neurons. The result for the MP model seems to violate the 
Gardner bound. However, as noticed by Nadal [21], one should 
be cautious in comparing these results: in our calculations we have 
required that a given pattern is stored perfectiy with probability 
one, whUe the Gardner calculation requires that all patterns are 
stored perfectly with probability one. As mentioned in the section 
'WiUshaw model', the capacity of the WiUshaw and MP models 
drops to iopt = 0.23 bits/synapse in the case of fixed-size patterns, 
if one insists that all patterns should be stored perfectiy, which is 
now consistent with the Gardner bound. This means that the MP 
model is able to reach a capacity which is roughly half the Gardner 
bound, a rather impressive feat given the simplicity of the rule. 
Note that supervised learning rules can get closer to these 
theoretical bounds [27]. 

We have also studied finite-size networks, in which we defined 
the capacity as the number of patterns for which the probabUity of 
exact retrieval is at least 50%. We found that networks of 
reasonable sizes have capacities that are far from the large N limit. 
For networks of sizes lO'' — 10* storage capacities are reduced by a 
factor 3 or more (see Figure 4). These huge finite size effects can 
be understood by the fact that the leading order corrections in the 

larae N limit are in — ^ ^ - and so can never be neglected 

^ InA ^ 

unless N is an astronomical number (see Methods). A large part of 

the decrease in capacity when considering finite-size networks is 
due to fluctuations in the number of selective neurons from pattern 
to pattern. In the last section, we have used inhibition to minimize 
the effect of these fluctuations. For instance, for a network of 
A^ = 1 0,000 neurons learning in one shot, inhibition allows to 
increase capacity from _P = 7,800 to P= 12,000. For finite size 
networks, memory patterns that are not perfectiy retrieved can stiU 
lead to fixed points where the activity is significantiy correlated 
with the memory patterns. We have investigated with simulations 
how aUowing errors in the retrieved patterns modifies storage 
capacity. For N = 10,000, the capacity increases from P = 7,800 to 
P= 10,400, i.e. by approximately 30%. 

Our study focused on networks of binary neurons, connected 
through binary synapses, and storing very sparse patterns. These 
three assumptions allowed us to compute analytically the storage 
capacit}' of the network in two learning scenarios. An important 
question is how far real neural networks are from such idealized 
assumptions. First, the issue of whether real synapses are binary, 
discrete but with a larger number of states, or essentially 
continuous, is stiU unresolved, with evidence in favor of each of 
these scenarios [28,29,30,31,32]. We expect that having synapses 
with a finite number K>2 oi states will not modify strongly the 
picture outlined here [17,33,20]. Second, it remains to be 
investigated how these results wUl generalize to networks of more 
realistic neurons. In strongly connected networks of spiking 
neurons operating in the balanced mode [34,35,36,37], the 
presence of ongoing activity presents strong constraints on the 
viability of sparsely coded selective attractor states. This is because 
'non-selective' neurons are no longer silent, but are rather active at 
low background rates, and the noise due to this background 
activity can easUy wipe out the selective signal [35,38]. In fact, 
simple scaling arguments in balanced networks suggest the optimal 
coding level would become f~\/\fN [3,39]. The learning rules 
we have considered in this paper lead to a vanishing information 
stored per synapse with this scaling. Finding an unsupervised 
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exp 



learning rule that achieves a finite information capacity in the 
large N limit in networks with discrete synapses for such coding 
levels remains an open question. However, the results presented = exp 

here show that for networks of realistic sizes, the information 
capacity at such coding levels is in fact not very far from the 
optimal one that is reached at lower coding levels (see vertical hues 
in Figure 5-7). Finally, the coding levels of cortical networks 
during delay period activity remain poorly characterized. Exper- 
iments in IT cortex [40,41,42] are consistent with coding levels of 
order 1 % . Our results indicate that in networks of reasonable sizes, 
these coding levels are not far from the optimal values. 

The SP and MP models investigated in this paper can be 
thought of as minimal models for learning in hippocampus and 
neocortex. The SP model bears some resemblance to the function 
of hippocampus, which is supposed to keep a memory of recent 
episodes that are learned in one shot, thanks to highly plastic 
synapses. The MP model relates to the function of neocortex, Now write 

where a longer-term memory can be stored, thanks to repeated 
presentations of a set of prototypes that occur repeatedly in the 
environment, and perhaps during sleep under the supervision of 
the hippocampus. The idea that hippocampal and cortical 
networks learn on different time scales has been exploited in 
several modeling studies [43,44,45], in which the memories are 
first stored in the hippocampus and then gradually transferred to 
cortical networks. It would be interesting to extend the type of 
analysis presented here to coupled hippocampo-cortical networks 
with varying degrees of plasticity. 

Methods 

Capacity calculation for infinite size networks 

We are interested at retrieving pattern (J'' that has been 
presented during the learning phase. We set the network in this 
state a=^^ and ask whether the network remains in this state 
while the dynamics (2) is running. At the first iteration, each 
neuron i is receiving a field and 



-ln(27t) 



(36) 



where we used StirUng formula for M,S»1, with $ defined in 
(13). For non-selective neurons 



-AfO) g, 



'M 



1, 



In 5 1 



S 
M 



r In (271) 



(37) 



(38) 



In the limit N-* + (X> we are considering in this section, and if 
Mg<fN6<Mg^, the sums corresponding to the probabilities 
P(h1<fNe),P(h'l>fNe) are dominated by their first term 

(corrections are made explicit in the following section). Keeping 
only higher order terms in M in Eqs. (36) and (37), we have: 



Piffi <m) ^ exp ( - M^{g+ ,6m)) 



(39) 



(35) 



Where M-l-1 is the number of selective neurons in pattern ij'', 
with M=0{\nN). Where we use the standard 'Landau' 
notations: a= 0(F{N)) means that a/F{N) goes to a finite limit 
in the large N limit, while a = o(F(N)) means that a/F{N) goes to 
zero in the large N limit, and A'^->-|-co. We recall that 

g^=p{Wij=\\^t=£.';=\) and g=p{Wij=mt,i';)^{u)). 

Thus X'l^ is a binary random variable which is 1 with probability, 
either gj^ if / is a selective neuron (sites i such that (^f = 1), or g if / 
is a non-selective neuron (sites / such that (Jf = 0). Neglecting 
correlations between Wy^ and Wij^ (it is legitimate in the sparse 
coding limit we are interested in, see [17]), the X^'s are 
independent and the distribution of the field on selective neurons 
can be written as 



PQi] = S)-- 



/ M 



g%{^-g+) 



P{Hl >fN6) ~ exp (- MOte.eM)), 



(40) 



yielding Eq. (15) with ()m = 6^=0{\). Note that with the coding 



levels we are considering here {f oc 



In TV 



M is of order In A'^. 



When the number of selective neurons per pattern is frxed at fN , 
we choose Md for the activation threshold and these equations 
become: 

X,= -\n N^<l>(g+ ,e) + 0( In In N) 



X„ = In iV( - P^igfi) -I- 1) + 0( In In Af) 



N 



(41) 



IniV 



where jS=/ 

For random numbers of selective neurons we need to compute 
the average over M: P„<,(iV)= Em=o P{M)P„e{M ,N). Since M 
is distributed according to a binomial of average Nf and variance 
Nf{\ —f)~Nf, for sufficiently large Nf, this can be approximat- 
ed as M=fN -\-z^JJN where z is normally distributed: 
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with 



,2 

dz —= exp ( — exp (Xs{z,N)) — exp {X„{z,I^))) 
1-00 y2n 



X,(z,N)=-M<^ig+,--^] +O(lnlnA0 



stored for a given set of parameters. However, it does not predict 
accurately the storage capacity of the large-size but finite networks 
that we simulated. In the second calculation presented, we focus 
(42) on computing the probabiht)' of no error in a given pattern P„i,, 
including a next-to-leading-order correction. 

Eq. (32) is derived for a fixed set of parameters, assuming that 
the set of active neurons have a fixed size, and that the activation 
threshold 6 has been chosen large enough such that the probability 
to have non-selective neurons activated is small. From the Stirling 
expansion, adding the first finite-size correction term in Eq. (41), 
we get 



i-p\nN 
-0( in in AO 



(46) 



with p/^ = M/ InN. For large iV, the number of stored patterns P 
can be increased until g+ {P) i 9. Setting g+ = 9 + e, an expansion 
of in £ allows to write 



-lilnN 



0(g+,(?)+^ln-^ 

Vjn l-g+ 



-hO(lnlniV) (43) 



and 



X„(z,N)=-MQ> 



I \ 

9 



1 + ^ 



+ InN+OilnlnN) 



AnN 



l-iS|cD(g,0)+^ln^ ^ 



0( In In AO (44) 



When A'^ goes to infinity, we bring the hmit into the integral in 
Eq. (42) and obtain 



Km P„e(N)- 

N-y + oo 



dz—^ lim exp[— exp(A'j(z,AO)— exp(X!(z,AO)] 



2% JV-.+CO 



= 0(3)(g+,0))0(-^(^,0) + l) 



(45) 



where 0 is the Heaviside function. Thus in the limit of infinite size 
networks, the probability of no error is a step function. The first 
Heaviside function imphes that the only requirement to avoid 
errors on selective neurons is to have a scaled activation threshold 
9 below g+. The second Heaviside function implies that, 
depending on fi, 9 has to be chosen far enough from g. The 
above equation allows to derive the inequalities (19) and (20). 

Capacity calculation for finite-size networks 

We now turn to a derivation of finite-size corrections for the 
capacity. Here we show two different calculations. In the first 
calculation, we derive Eq. (32), taking into account the leading- 
order correction term in Eq. (43). This allows us to compute the 
leading-order correction to the number of patterns P that can be 



X,^-\nNPM 



20(1- 



+ -\n\nN (47) 
2 



The P patterns are correctly stored as long as A'j « — 1 . This 



condition is satisfied for e< 



9{\-9)\n\nN 



. For the SP model. 



we can deduce which value of -P yields this value of e (see Eq. (26)). 
This allows to derive Eq. (32), 



9-g J {XnNf 



/iSM(0-^)ln 



9-g 



InlnAf nnXnN 
\nN \nN 



(48) 



We now turn to a calculation of the probability of no error on a 
given pattern P„(,, taking into account the next-to-leading order 
correction of order one, in addition to the term of order In In A^ in 
Eq. (41). This is necessary to predict accurately the capacity 
of realistic size networks (for instance for Af= 10,000, 
lnlnA''~2 = 0(1)). P„(.(M) is computed for a memory pattern 
with M selective neurons. The estimation of P„,, used in the 
figures is obtained by averaging over different values of M, with 
M drawn from a binomial distribution of mean fN. 

We first provide a more detailed expansion of the sums in Eq. 
(38). Setting S=fN9 + k, with the Taylor expansions: 



M(i>ig,9M + 



M 



M<b{g,9M)+k — (g,9M)+ 2ji^^fe'''^) + 0 



M2 



(49) 



In S{ 1- 



M 



:ln(MMl-M) 



(50) 
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where Om = 
rewrite: 



M 



and AO,} ■- 



1 

9m 



-. Using (37) we can 



M-fNO 



1 k^d^(5>. 



do 



rte,eM)-fcA9^' \+o 



(51) 



(37) and (53),(52): 
exp 



In N{ - PM^(g+ M) - 2 In In AT 



^ln(^2;:0M(l-0M)[l-exp(^(g+,0M))]' 



(56) 



In the cases we consider, we will always have ^(g,0M)#O so 

Op 

that we can consider only the term of order 1 in M. The sum is 
now geometric, and we obtain 



S>fNB 



P(/i« = 5) _ 1 

• ^ ' J i_exp(-— (^.Sm)) 



Vo{\) (52) 



V'{Kl>fN0) = 
exp 



ln7V(-ySj^3)(g,(?M))-2lnlnAr 

50), 



-ln(^2;i0M(l-eM)[l- exp(- ^fc0M))]' 



The probability of no error is 



(57) 



The same kind of expansion can be applied for the selective 

(5<I> 

neurons. Again if we are in a situation where ^(g+,^^M)#0, 

ov 



S<fN6 ^' J l_exp(— 



, = (1 - P(/!j <flsre)r(i - P(fii > fN9)) 



exp ( — exp Xs — exp X„) 



,N-M 



which leads to Eqs. (31) 



(58) 



When ?! is close to 9 and thus — — (e"+ ,9 m) — 0, we are then left 

39 

with: 



BM 

^exp 

4 = 0 



1 //k2 32^ 



M I 2 



(g+,dM)-kA9j 



(54) 



= - iSm® + .^m) In iV + - In In TV 



;in 



1 - exp ( ^ {g+ ,9m)) ] 27c0m(1 - 9m) 



+ 0(1) 



= exp 



1 e^(D 



8M ae- 



2 exp 

k=0 



{k-A9^d'<:> 



+ 0(1) 



=1 



Xn = {-PM^igfiM)+l)\nN--\n\nN 



:ln 



1 - exp(- I 2n9Mil-9M)P, 



+ 0(1) 



Gaussian approximation of the fields distribution 

For a fixed number M +1 of selective neurons in pattern , 
approximating the distribution of the fields on background 
neurons h" and selective neurons Ifj with a gaussian distribution 
gives: 



\ 



M 



2a2(t 



+ 0(1) 



(?+.M 



(55) 



^^{h1 = S)-- 



(59) 



When g+ is too close to 9, which is the case for the optimal 
parameters in the large N Kmit, we need to use (55). It only 
contributes a term of order In In A'^ in X^ and does not modify our 
results. In Figures 6-7, we use (53), which gives from (38) and (36), 



where 



and 



Hb = Mg, al = Mg{l-g) 



(60) 
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1 / (s-i^fY 



where 



fij- = Mg+, (j^^=Mg+{l-g+) 



(61) 



(62) 



The probability that those fields are on the wrong side of the 
threshold are: 



fNB 



and 



ffNB 

'(h'i<fNe)= P^{h] = z)dz 

J — 00 



(63) 



(64) 



Following the same calculations presented, and keeping only 
terms that are relevant in the limit A'^-* + oo, the probability that 
there is no error is given by: 



0(<i)'?fe+,e))0(-jS(i)'^fe,(?)+i) 

where the rate fimction is 



a)''(x,0)= 



2x(l — x) 



(65) 



(66) 



Calculations with the binomial versus the gaussian approxima- 
tion differ only in the form of fl>. Finite size terms can be taken into 
account in the same way it is done in the previous Methods section 
for the binomial approximation. 

In all above calculations we assumed that fields are sums of 
independent random variables (35). For small / correlations are 
negligible [17,19]. It is possible to compute the covariances 
between the terms of the sum (see Eq. (3.9) in [19]), and take them 
into account in the gaussian approximation. This can be done 
using 



ci = Mg{i-g) + M{M-\)y 



al = Mg+{\-g+) + M{M-\yy 



in Eqs. (59),(61), where 



y=f 



(67) 



(68) 



(69) 
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